Mapping of Sequence Reads to the Reference Genomes    ◾    65

information. Most aligners are capable of performing both exact matching and inexact

matching, which are essential to find the locations of reads that may have some base call

errors or varied genetically from the reference genome. The different aligners implement

different ­algorithms to perform both kinds of lookups in the indexed reference genome

stored in data structures like suffix tree, suffix array, hashing table, and BWT. While the

exact lookup is straightforward, the inexact matching uses sequence similarity to find the

most likely locations where a read is originated. Although there are different ways to mea-

sure sequence similarity, most aligners used Hamming distance [12] or Levenshtein dis-

tance [13] to score the similarity between a reads and portions of the reference genomes

based on a threshold. Some aligners use the seed-and-extend strategy to extend a seed (an

exact matched substring) across multiple mismatched bases to allow mapping reads with

base call errors or variations. Most aligners employ seed-and-extend strategy on the local

sequence ­alignment using SW algorithm. Seeds are created by making overlapping k-mers

(substrings or words of length k) from the reference genome sequence. Some aligners like

Novoalign [14] and SOAP [15] index k-mers with the trie or hash table data structures for

a quick search.

2.3.1  SAM and BAM File Formats

Almost all read aligning programs (aligners) store alignment information of the reads

mapped to the reference genome in a Sequence Alignment and Map (SAM) file or Binary

Alignment and Map (BAM) file, which is the binary form of SAM. The SAM file is a read-

able plain text file for storing biological sequences mainly aligned to a reference sequence

[16] but it can also contain unmapped reads. It is a TAB-delimited text file consisting of

two main sections: (i) a header section and (ii) an alignment section.

The header section of the SAM file is optional, and when it is present, it must be before

the alignment section. Each line in the header section must start with “@” symbol followed

FIGURE 2.14  A header section of a SAM file.